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The  High  Performance  Embedded  Computing  Software  Initiative  (HPEC-SI)  Development 
Working  Group  has  been  creating  VSIPL++,  a  new  software  standard  to  promote  portability, 
productivity,  and  performance  in  embedded  parallel  systems.  This  standard  expands  the  Vector 
Signal  Image  Processing  Library  (VSIPL)  to  encompass  parallel  systems  in  C++.  HPEC-SI  has 
contracted  with  CodeSourcery,  LLC,  to  produce  a  VSIPL++  Reference  Library.  The  Reference 
Library  is  intended  to  allow  early  users  to  experiment  with  the  functionality  of  VSIPL++. 

This  presentation  discusses  a  project  to  evaluate  the  VSIPL++  specification  by  using  the 
CodeSourcery  VSIPL++  Reference  Library  to  implement  a  part  of  the  current  operational  signal 
processing  code  for  the  Deployable  Autonomous  Distributed  System  (DADS).  The  authors  of 
this  presentation  have  extensive  experience  in  developing  and  using  the  original  VSIPL  library, 
working  with  parallel  signal  processing  algorithms,  and  developing  other  HPC  middleware  and 
standards.  They  have  been  involved  with  the  development  of  the  VSIPL++  standard,  but  are  not 
C++  programming  experts.  One  aspect  of  this  work  will  be  to  compare  features  and  ease-of-use 
of  VSIPL  and  VSIPL++. 

The  Deployable  Autonomous  Distributed  System  (DADS)  is  an  advanced  development  program, 
sponsored  by  ONR-321,  to  demonstrate  deployable  autonomous  undersea  technology  for 
operations  in  coastal  waters.  The  system  consists  of  small  acoustic  arrays  on  the  ocean  floor 
with  embedded  in-node  signal  processing.  Detections  are  transmitted  to  the  surface  using 
acoustic  modems. 

The  current  DADS  acoustic  beamfonning  software  is  written  in  ANSI  C,  is  available  in  both 
development  and  embedded  configurations,  and  is  unclassified.  The  code  is  sequential,  but 
future  hardware  and  algorithm  upgrades  could  require  parallelization.  Of  several  candidate 
software  modules,  the  beamformer  was  chosen  as  most  appropriate  for  conversion  to  VSIPL++. 
The  DADS  beamformer  does  either  adaptive  processing  (using  a  minimum  variance 
distortionless  response  algorithm)  or  conventional  beamforming.  The  beamformer  source  code 
consists  of  about  1000  lines  of  non-blank,  non-comtnent  code. 

The  project  established  a  test  data  set  and  an  environment  in  which  the  current  DADS 
beamforming  code  could  be  executed.  The  code  was  then  rewritten  in  C++  using  the 
CodeSourcery  VSIPL++  Reference  Library  and  the  results  of  running  the  new  code  were 
verified.  Metrics  were  recorded  on  the  time  to  develop  the  code  and  the  resulting  changes  in 
lines  of  code  from  the  original  version. 

We  will  report  on  these  metrics  and  other  lessons  learned.  Of  particular  interest  will  be  whether 
addressing  real-world  algorithms  exposes  any  functional  problems  or  deficiencies  of  the 
VSIPL++  specification  that  should  be  addressed  by  the  HPEC-SI  Working  Group. 
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VSI PL+  +  Demonstration 


•  HPEC-SI  is  moving  VSI  PL  functionality  to  object 
oriented  programming  and  C++:  VSIPL++ 

•  Goal  of  this  demonstration: 

-  Evaluate  the  draft  VSI  PL++  Serial  Specification 

-  I  dentify  both  advantages  and  problems  with  the  VSI  PL++ 
methodology 

-  Suggest  improvements 

•  Method 

-  Port  a  DoD  acoustic  beamformer  algorithm  written  in 
standard  C  to  use  VSI  PL++  and  C++ 

-  Measure  and  Evaluate  (when  compared  to  baseline  code) 
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Deployable  Autonomous  Distributed 

System  (DADS) 


•  DADS  Goals 

-  Develop  and  demonstrate  deployable  autonomous 
undersea  technology  to  improve  the  Navy's 
capability  to  conduct  effective  Anti-Submarine 
Warfare  and  Intelligence- Surveillance- 
Reconnaissance  operations  in  shallow  water 

•  Sponsor:  ONR  321 

http:  //www.  onr.  navy,  mi  l/sci_tech/  ocean/  32  l_sensi  ng/  i  nfo_deploy.  htm 
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DADS  Concept 


LCS 


•  Sensors,  Arrays  &  Sources 

-  Acoustic 

-  Electromagnetic 

•  Communication  Links 

-  RF  buoys  &  AUV  gliders 

-  Acoustic  modems 

•  In-Node  Signal  Processing 

-  Acoustic,  passive  &  active 

-  Electromagnetic 

-  Sensor  data  fusion 

•  Master  Node 

-  Network  control 

-  Network  data  fusion 
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Field  System 


DADS  Beamformer 


•  Signal  processing  program  chosen  for 
conversion  is  DADS  multi- mode  beamformer 

-  Adaptive  minimum  variance  distortionless 
response 

•  Current  software  is  . . . 

-  Sequential  ANSI  C 

-  About  1400  lines  of  C  source  code 

-  Pointer- ized  —  no  vectorization 
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Approach 


•  Establish  test  data  and  environment  to  execute  and 
validate  current  code 

•  Analyze  existing  code  and  data  structures 

•  Vectorize 

•  Rewrite  module  using  VSI PL++ 

•  Validate  VSI  PL++  version 

•  Report  specification  issues  and  code  metrics 

Used  pre-release  of  CodeSourcery  sequential 
VSI  PL++  reference  implementation  which  in 
turn  uses  the  VSI  PL  reference  implementation 
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Deliverables 


•  Metrics 

-  SLOC 

-  Lines  changed  if  appropriate 

-  Time  to  develop 

-  Others 

•  Report  results  and  lessons  learned 

-  H  PEC- SI  workshop 

-  DADS  Annual  Program  Review  for  ONR,  project 
personnel,  industrial  partner  (Undersea  Sensor 
Systems  I  nc. ) 
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I  nitial  Steps 


•  Established  testable  code  baseline 

-  Wrapped  module  in  executable  program 

-  Set  up  test  data  file  and  associated  parameters 

-  Set  up  validation  procedures 

•  Analyzed  baseline  code 

-  Figured  out  what  algorithms  were  implemented 

-  Mapped  program  data  flow 
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Dual  I  mplementations 


•  Starting  from  scratch  based  on  analysis  of 
original  program 

-  Insight,  trial  approaches  to  sub-problems 

•  I  ncremental  modification  of  original  program 

-  Vectorization 

•  Un-pointerize 

•  Reorder  tests  within  loops 

•  Recast  loops  into  vector  and  matrix  operations 

-  VSIPL++  -ization 

-  This  version  chosen  for  final  solution  and 
metrics 
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Example  of  Typical  Code 


f  r  pt  r 

=  fr; 

/  / 

poi  nt  er 

t  o 

replica  buffer  (real) 

f  i  pt  r 

=  f  i  ; 

/  / 

poi  nt  er 

t  o 

replica  buffer  (irrag) 

for  (  i 

f  r  eq  = 

=  i 

bi  nl;  i  f  r  eq 

<=  i  b  i  n  2 ;  ifreq  ++) 

// 

pr  oduce 

one  row 

of 

the  wei  ght  rrat  r  i  x  at  a 

t  i  nre 

f  or 

(  i  a ng  = 

=  0;  i  ang 

< 

n  a  n  g ;  i  ang  ++)  II  loop 

over  bear i  ngs 

f  or 

(  i 

=  0;  i 

<  nh;  i  ++)  //  copy  a  row 

of  the  replica 

sr  [  i  ] 

= 

*f  rptr; 

s  i  [  i  ] 

= 

*  f  i  pt  r  ; 

f  r  pt  r  ++; 
f  i  pt  r  ++; 

f or  (  i  =0;  i  <  nh;  i  ++)  //  I  oop  over  hydrophones 

wr[i]  =  wt  [  i  ]  *  sr[i]; 

wi[i]  =  wt[i]  *  si  [  i  ]  ; 


for  (int  ifreq  =ibinl;  ifreq  <=  i  bi  n2;  i  f  r  eq++) 

w  =  vs  i  p:  :  vmrrul  <0>(  wt,  repl  i  ca.  get_xy(i  freq-i  bi  nl)  )  ; 
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Code  Metrics 


•  Number  of  files  increased  from  8  to  14 

•  SLOC  for  all  source  files 

-  Counting  semicolons: 

•  Baseline  887 

•  VSIPL++  630  -29% 

-  Counting  non-blank,  non-comment  lines: 

•  Baseline  1389 

•  VSIPL++  1018  -27% 

•  Heart  of  the  beamformer  calculation  (all  lines): 

•  Baseline  410 

•  VSIPL++  180  -56% 

•  Lines  of  code  changed:  Most! 
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Memory  Size  Metrics 


•  Binary  program  sizes  (statically  linked): 

HP-UX/PA-RISC  Red  Hat/Pentium 

-  Baseline  560  KB  700  KB 

-  VSIPL++  1,800  KB  3,900  KB 

•  Memory  footprint  and  usage: 

-  Weren't  able  to  measure  this 

-  VSI PL++  programs  might  be  expected  to  use 
larger  structures 

•  For  example,  N  vectors  become  a  matrix 

-  For  this  program's  statically  allocated  structures 
and  arrays,  it  should  be  a  wash 
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•  64  input  sensors,  64 
output  beams 

•  64x64  covariance  matrix 

•  Forward  FFTs  64  x  1024 

•  I  nverse  FFTs  64  x  1024 

•  Smaller  data  set 


•  Fewer  larger  objects 
created,  more 
computing  per  object 
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14  input  sensors, 
108  output  beams 


•  14x14  covariance  matrix 

•  Forward  FFTs  14  x  2048 

•  I  nverse  FFT s  108  x  2048 

•  Larger  data  set 


•  More  smaller  objects 
created,  object  creation 
amortized  over  less 
computing 
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Execution  Time  Examples 


□  Baseline 
■  VSIPL++ 


64  sensors,  64  beams 


14  sensors,  108  beams 
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seconds 


Profiling  Results  for  PA-RI  SC 


64  sensors,  64  beams,  1024  point  FFTs 

100  -r - 

90 - - 

80 - - 

70 - 1 - - 

60 - 1 

50 - 1  - - 

40 - 1 

30 - si 

20 -  - 

10 - - 

0  -I - ^ - f - 

Baseline  VSIPL++ 

PA-RISC  8600,  550  MHz, 

HP-UX  11.11,  g++  3.3.2 


14  sensors,  108  beams,  2048  point  FFTs 


Baseline  VSIPL++ 

PA-RISC  8600,  550  MHz, 
HP-UX  11.11,  g++  3.3.2 


□  other 

□  malloc/free 

□  FFT 

□  main 

□  copy, get, put 

□  other  VS  IPL 
■  solve 

□  decompose 
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Profiling  Results  for  PowerPC 


64  sensors,  64  beams,  1024  point  FFTs 
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Baseline  VSIPL++ 

PowerPC,  1.25  GHz, 

OS  X  10.3.4,  g++  3.3 


□  other 

□  malloc/free 

□  FFT 

□  main 

□  copy, get, put 

□  other  VS  IPL 
■  solve 

□  decompose 


14  sensors,  108  beams,  2048  point  FFTs 
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Baseline  VSIPL++ 

PowerPC,  1.25  GHz, 

OS  X  10.3.4,  g++  3.3 


□  other 

□  malloc/free 

□  FFT 
■  main 

□  copy, get, put 

□  other  VS  IPL 

□  solve 

□  decompose 
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Profiling  Results  for  Pentium 


64  sensors,  64  beams,  1024  point  FFTs 
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Pentium,  450  MHz, 
Red  Hat  8.0,  g++  3.2 


□  other 

□  malloc/free 

□  FFT 

□  main 

□  copy, get, put 

□  other  VS  IPL 

□  solve 

□  decompose 


14  sensors,  108  beams,  2048  point  FFTs 


350 


300 


250 


«  200 

X) 

c 
o 
o 

Q) 

m  150 


100 


50 


Baseline 


VSIPL++ 


Pentium,  450  MHz, 
Red  Hat  8.0,  g++  3.2 


□  other 

□  malloc/free 

□  FFT 

□  main 

□  copy, get, put 

□  other  VS  IPL 

□  solve 

□  decompose 
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Object  Creation 


•  Previous  experience  with  VSI  PL  has  shown 

-  Object  creation  in  inner  loops  is  inefficient 

-  Solution  is  early  binding  /  late  destroys 

•  VSIPL++  reference  implementation  uses 
VSI  PL  library  as  its  compute  engine 

-  Observed  similar  inner-loop  inefficiencies 

-  C++  new()  called  to  create  subviews  of  data 

•  A  purely  C++  VSI  PL++  implementation 
would  avoid  some  of  these  problems 
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Overall  I  ssues 


•  Additional  data  copying  a  potential  problem 

-  Improvements  in  reference  library  will  remove 
some  of  this 

•  Memory  allocation 

-  A  clever  implementation  might  avoid  much  of  this 

-  Proposal  to  improve  specification  so 
implementation  can  avoid  calls  to  C++  new()  in 
inner  loops 

•  Binary  program  size  for  embedded  systems 
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VSIPL++  Specification 


•  Issues  with  specification 

-  I/O  for  data  Fixed  in  final  spec 

-  Row/ Column  major  Fixed  in  final  spec 

•  matrix  layout  in  memory 

-  Real  and  I  magi  nary  subviews  Fixed  in  final  spec 

-  Sticky  subview  variables  with  remapping 

Proposed  fix  for  final  spec 

•  There  were  still  limitations  in  the  VSI PL++  reference 
implementation  we  used 

-  Tensors 

-  Transpose  views  and  operations 
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Ongoing  VSI PL++  Questions 


•  Knowing  when  data  is  copied  and  when  it 
isn't  and  what  we  can  do  about  it:  there  are 
subtle  C++  distinctions 

•  Continuing  general  concern  about  efficiency 

•  Use  of  bleeding-edge  C++  features  and 
compiler  compatibility 
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Our  Contributions 


•  Demonstrated  that  VSI PL++  can  be  used  for 
real  DoD  application  code 

•  Close  look  at  details  improved  specification 

-  Fixing  inconsistencies  and  small  errors 

-  I  mproving  understandability  of  the  spec 

•  Redesign  of  the  FFT  and  multi  pie- FFT  API 

•  Bug  fixes  in  reference  implementation 

•  I  improvements  to  underlying  VSI  PL  reference 
library 


HPEC  2004  —  DADS  and  VSIPL++ 


23 


Conclusions 


•  VSI PL+  +  serial  specification  has  the 
functionality  to  implement  a  typical  DoD 
signal  processing  application 


•  Resulting  code  is  more  understandable  and 
maintainable 


•  VSI  PL++  can  deliver  comparable  performance 
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